16 research outputs found
SiT: Self-supervised vIsion Transformer
Self-supervised learning methods are gaining increasing traction in computer
vision due to their recent success in reducing the gap with supervised
learning. In natural language processing (NLP) self-supervised learning and
transformers are already the methods of choice. The recent literature suggests
that the transformers are becoming increasingly popular also in computer
vision. So far, the vision transformers have been shown to work well when
pretrained either using a large scale supervised data or with some kind of
co-supervision, e.g. in terms of teacher network. These supervised pretrained
vision transformers achieve very good results in downstream tasks with minimal
changes. In this work we investigate the merits of self-supervised learning for
pretraining image/vision transformers and then using them for downstream
classification tasks. We propose Self-supervised vIsion Transformers (SiT) and
discuss several self-supervised training mechanisms to obtain a pretext model.
The architectural flexibility of SiT allows us to use it as an autoencoder and
work with multiple self-supervised tasks seamlessly. We show that a pretrained
SiT can be finetuned for a downstream classification task on small scale
datasets, consisting of a few thousand images rather than several millions. The
proposed approach is evaluated on standard datasets using common protocols. The
results demonstrate the strength of the transformers and their suitability for
self-supervised learning. We outperformed existing self-supervised learning
methods by large margin. We also observed that SiT is good for few shot
learning and also showed that it is learning useful representation by simply
training a linear classifier on top of the learned features from SiT.
Pretraining, finetuning, and evaluation codes will be available under:
https://github.com/Sara-Ahmed/SiT
LT-ViT: A Vision Transformer for multi-label Chest X-ray classification
Vision Transformers (ViTs) are widely adopted in medical imaging tasks, and
some existing efforts have been directed towards vision-language training for
Chest X-rays (CXRs). However, we envision that there still exists a potential
for improvement in vision-only training for CXRs using ViTs, by aggregating
information from multiple scales, which has been proven beneficial for
non-transformer networks. Hence, we have developed LT-ViT, a transformer that
utilizes combined attention between image tokens and randomly initialized
auxiliary tokens that represent labels. Our experiments demonstrate that LT-ViT
(1) surpasses the state-of-the-art performance using pure ViTs on two publicly
available CXR datasets, (2) is generalizable to other pre-training methods and
therefore is agnostic to model initialization, and (3) enables model
interpretability without grad-cam and its variants.Comment: 5 pages, 2 figure
Masked Momentum Contrastive Learning for Zero-shot Semantic Understanding
Self-supervised pretraining (SSP) has emerged as a popular technique in
machine learning, enabling the extraction of meaningful feature representations
without labelled data. In the realm of computer vision, pretrained vision
transformers (ViTs) have played a pivotal role in advancing transfer learning.
Nonetheless, the escalating cost of finetuning these large models has posed a
challenge due to the explosion of model size. This study endeavours to evaluate
the effectiveness of pure self-supervised learning (SSL) techniques in computer
vision tasks, obviating the need for finetuning, with the intention of
emulating human-like capabilities in generalisation and recognition of unseen
objects. To this end, we propose an evaluation protocol for zero-shot
segmentation based on a prompting patch. Given a point on the target object as
a prompt, the algorithm calculates the similarity map between the selected
patch and other patches, upon that, a simple thresholding is applied to segment
the target. Another evaluation is intra-object and inter-object similarity to
gauge discriminatory ability of SSP ViTs. Insights from zero-shot segmentation
from prompting and discriminatory abilities of SSP led to the design of a
simple SSP approach, termed MMC. This approaches combines Masked image
modelling for encouraging similarity of local features, Momentum based
self-distillation for transferring semantics from global to local features, and
global Contrast for promoting semantics of global features, to enhance
discriminative representations of SSP ViTs. Consequently, our proposed method
significantly reduces the overlap of intra-object and inter-object
similarities, thereby facilitating effective object segmentation within an
image. Our experiments reveal that MMC delivers top-tier results in zero-shot
semantic segmentation across various datasets
Deep Convolutional Neural Network Ensembles using ECOC
Deep neural networks have enhanced the performance of decision making systems
in many applications including image understanding, and further gains can be
achieved by constructing ensembles. However, designing an ensemble of deep
networks is often not very beneficial since the time needed to train the
networks is very high or the performance gain obtained is not very significant.
In this paper, we analyse error correcting output coding (ECOC) framework to be
used as an ensemble technique for deep networks and propose different design
strategies to address the accuracy-complexity trade-off. We carry out an
extensive comparative study between the introduced ECOC designs and the
state-of-the-art ensemble techniques such as ensemble averaging and gradient
boosting decision trees. Furthermore, we propose a combinatory technique which
is shown to achieve the highest classification performance amongst all.Comment: 13 pages double column IEEE transactions styl
Skin lesion classification with deep CNN ensembles
Early detection of skin cancer is vital when treatment is most likely to be successful. However, diagnosis of skin lesions is a very challenging task due to the similarities between lesions in terms of appearance, location, color, and size. We present a deep learning method for skin lesion classification by fusing and fine-tuning three pre-trained deep learning architectures (Xception, Inception-ResNet-V2, and NasNetLarge) using training images provided by ISIC2019 organizers. Additionally, the outliers and the heavy class imbalance are addressed to further enhance the classification of the lesion. The experimental results show that the proposed framework obtained promising results that are comparable with the ISIC2019 challenge leader board
Deep learning ensembles for image understanding
Deep neural networks have enhanced the performance of decision making systems in many applications, including image understanding. Further performance gains can be achieved by using ensemble methods, which are shown to be powerful tools for various classification and regression tasks. This dissertation consists of two parts. The first part is devoted to studying the face attributes classification problem. We introduce several novel approaches for this problem, achieving state-of-art results on CelebA and LFWA datasets: i) we use the multi-task learning (MTL) framework for multiple attributes classification for scalability, where base learners are grouped according to the location of the attribute on the face and share weights. Giving information about the location of an attribute as prior information is shown to speed up the learning process and lead to increased accuracy. ii) we introduce a novel ensemble learning technique within the deep learning model itself (within-network ensemble), showing increased performance at almost the same time complexity of a single model. iii) we propose a new framework called Deep-RankSVM for relative attribute classification (comparing the attribution expression on two photographs) adapting the SVM formulation to deep rank learning. The second part is devoted to analyzing the suitability of different state-of-art design strategies for constructing ensembles of deep networks. We propose the Error Correcting Output Codes (ECOC) framework as a novel deep learning ensemble method, and show that it can be used with the MTL framework for arbitrary accuracycomplexity trade-off. We carry out an extensive comparative study between the introduced ECOC designs and the state-of-the-art ensemble techniques such as ensemble averaging and gradient boosting decision trees, on several datasets. In the rest of the dissertation, we discuss general applications of the proposed ensemble techniques that include skin lesion classification and plant identification
Relative attribute classification with deep-ranksvm
Relative attributes indicate the strength of a particular attribute between image pairs. We introduce a deep Siamese network with rank SVM loss function, called Deep-RankSVM, that can decide which one of a pair of images has a stronger presence of a specific attribute. The network is trained in an end-to-end fashion to jointly learn the visual features and the ranking function. The trained network for an attribute can predict the relative strength of that attribute in novel images. We demonstrate the effectiveness of our approach against the state-of-the-art methods on four image benchmark datasets: LFW-10, PubFig, UTZap50K-2 and UTZap50K-lexi datasets. Deep-RankSVM surpasses state-of-art in terms of the average accuracy across attributes, on three of the four image benchmark datasets
Relative attributes classification via transformers and rank SVM loss
We propose a new model for learning to rank two images with respect to their relative strength of expression for a given attribute. We address this problem – called relative attribute learning — using a vision transformer backbone. The embedded representations of the two images to be compared are extracted and used for comparison with a ranking head, in an end-to-end fashion. The results demonstrate the strength of vision transformers and their suitability for relative attributes classification. Our proposed approach outperforms the state-of-the-art by a large margin, achieving 90.40% and 98.14% mean accuracy over the attributes of LFW-10 and Pubfig datasets